
This is the single cell RNA sequencing (scRNAseq) data analysis tutorial, prepared for the scRNAseq days of the Bioinformatics in Neuroscience course, organised for the Neuroscience and Cognition master and other Utrecht University master programs. The script follows the well known vignette of the Seurat (v3) pipeline for pbmc, adapted to analyse the scRNAseq data of the adult human brain.
1. Gain the capacity to run a standard scRNAseq data analysis pipeline independently
2. Have the ability to interpret the results of single cell data analysis
The course used the CoCalc server: https://cocalc7.science.uu.nl/
1. What are the major considerations during the preprocessing steps?
2. How does dimensionality reduction help us understand the data?
3. What is a molecular signature? How can marker genes help us undertand cellular (and brain) function?
Some good practice
To keep everything ordered, make some folders for saving the data and the figures. You will receive a warning if you have already made these folders
dir.create("write/") dir.create("figures/")
Load the necessary libraries to run our pipeline based on the well-known Seurat pipeline. You can find detailed information on Seurat as well as tutorials on its website (https://satijalab.org/seurat/) or on Github (https://github.com/satijalab/seurat)
We are going to use Seurat v3, which is well established. Note that there is a recent beta release (v4) with additional functionalities
# Load the libraries library(Seurat,verbose = FALSE) library(ggplot2,verbose = FALSE) library(RColorBrewer,verbose = FALSE) library(useful,verbose = FALSE) library(readr,verbose = FALSE) library(hdf5r,verbose = FALSE) # check if the following are on library(dplyr,verbose = FALSE) library(stringr,verbose = FALSE)
We will use the 10x Genomics data generated from the adult human brain. Since 4M cells are too many, we are using a ~5k subset. The data is stored in two seperate files
# This contains the actual data as a cell x gene matrix. Each cell in this matrix contain information on how many reads we have for a given gene in a given cell count_data <- read_rds("../rawdata/A46_10X176_8_count_data.rds")
# the dimensions for a sanity check. The first dimension shows the number of features (in this case, genes) while the second indicates teh number of cells dim(count_data)
# This contains the metadata. Each column contains information for each cell in teh dataset. metadata <- read_rds('../rawdata/A46_10X176_8_metadata.rds')
# what do we have in the metadata? colnames(metadata)
Here, we will create a "seurat object" which has a different structure than a data table. Seurat object contains the data as a matrix in a separate slot, but can also have 'features' that are linked to genes or cells. For instance cell types for each cell can be stored in a separate slot. In addition, we can store additional information e.g. results of differential expression test or coordinates for each cell after a given dimensionality reduction (e.g. pca)
'min.cells' is to filter genes: only genes detected in min of these number of cells will be kept 'min.features' is to filter cells: only cells that have a minimum total of given number of features will be kept Thus, these two provide initial quality control of the data
dataset <- CreateSeuratObject(counts = count_data, min.cells = 1, min.features = 1000, meta.data = metadata, project = "Human_brain")
visualise the number of genes and features, which are already provided in the dataset
head(dataset@meta.data)
| orig.ident | nCount_RNA | nFeature_RNA | Age | Class_auto.annotation | Clusters | Donor | Neuropeptide_auto.annotation | Neurotransmitter_auto.annotation | Roi | SampleID | Subtype_auto.annotation | Supercluster | Tissue | n_genes | n_counts | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <fct> | <dbl> | <int> | <dbl> | <chr> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | |
| 10X176_8:TCGTCCATCGCCTAGG | 10X176 | 2794 | 1729 | 50 | NK | 2 | H18.30.002 | nan | nan | Human A46 | 10X176_8 | nan | Miscellaneous | Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A46 | 1729 | 2794 |
| 10X176_8:TCACACCGTGTCTTAG | 10X176 | 2192 | 1367 | 50 | MGL | 8 | H18.30.002 | IGF | nan | Human A46 | 10X176_8 | nan | Microglia | Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A46 | 1367 | 2192 |
| 10X176_8:TCGACGGGTTGCATTG | 10X176 | 3368 | 2010 | 50 | MGL | 8 | H18.30.002 | IGF | nan | Human A46 | 10X176_8 | nan | Microglia | Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A46 | 2010 | 3368 |
| 10X176_8:GAACACTAGTATGGCG | 10X176 | 3871 | 2116 | 50 | MGL | 8 | H18.30.002 | IGF | nan | Human A46 | 10X176_8 | nan | Microglia | Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A46 | 2116 | 3871 |
| 10X176_8:GTGCAGCTCGACCAAT | 10X176 | 3729 | 2124 | 50 | MGL | 8 | H18.30.002 | IGF | nan | Human A46 | 10X176_8 | nan | Microglia | Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A46 | 2124 | 3729 |
| 10X176_8:GACAGCCAGTGCACCC | 10X176 | 3288 | 1898 | 50 | MGL | 8 | H18.30.002 | IGF | nan | Human A46 | 10X176_8 | nan | Microglia | Cerebral cortex (Cx) - Middle frontal gyrus (MFG) - A46 | 1898 | 3288 |
The first visual quality control is to look at the number of cells (nCount_RNA) and number of genes (nFeature_RNA) per cells. These features are calculated while generating the Seurat object. For instance, the number of RNA molecules per cell is stored as 'nCount_RNA', which is located in dataset@meta.data$nCount_RNA
plot1 <- VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA"),pt.size = .01) plot2 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") plot1 | plot2
What do you see? :
1. How do the number of genes and RNA molecules correlate?
2. There are two populations visible on the violin plot... What do these mean?
Answer whenever you want!
set a name that will be used later for the files you save. Anything goes. You can add your own name
name = "my_analysis_name"
mRNA coding for the Ribosomal subunit proteins is abundant (not to be confused with rRNA). Usually, a high ribosomal RNA percentage indicates production of a lot of proteins, and is very high in dividing cells or some secretory cells that need to constantly produce proteins. However, if most of the mRNA (>30-50%) that we detect is ribosomal, it means that we wont learn much from this cell, and that we should exclude it from the analysis.
Note that we are analysing single nuclei data. We have the following question:
1. How do you think this will effect our quality control?
2. In this case, what does high ribosomal/mitochondrial percentage mean?
Spend 1 min to think about the questions. Then, talk to the person next to you about what you think. Finally, summarise it in one sentence
# Start by generating QC metrics additional to the no of genes/features dataset <- PercentageFeatureSet(dataset,pattern='^MT-', col.name = "percent.mt") #Select all features starting with MT-, which are the Mitochondrial genes dataset <- PercentageFeatureSet(dataset,pattern='RP(S|L)', col.name = "percent.ribo") # Select all that starts with RPS or RPL
Plot and inspect. You can remove the ots in case they cause too much moise
options(repr.plot.width=12, repr.plot.height=6) VlnPlot(object = dataset, features = c("percent.mt", "percent.ribo"))
You can save them all as a pdf file to use in presentations, reports etc
pdf(paste0(name,"_Vlnplot_stats_all.pdf"),height = 6,width = 12) VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA","percent.mt", "percent.ribo")) # If there are too many points for your liking, you can turn them off as follows: #VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA","percent.mt", "percent.ribo"),pt.size = 0, cols = "blue") dev.off()
Visualise multiple metrics for each cell as a measure of the number of reads (in other words; UMI, or unique RNA molecules). This helps use to evaluate the quality of the data in multiple ways
plot1 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "percent.mt") plot2 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "percent.ribo") plot1 | plot2
Decide on the filtering parameters that you want to use. This is not always objective! However, you can see the 'outliers' in the data rather well
#Now filter dataset <- subset(x = dataset, subset = percent.mt < 5 & nCount_RNA < 90000 & percent.ribo < 2)
dataset
Now get rid of the mitochondrial genes which we do not need for downstream analysis.
dataset <- subset(x = dataset[-grep(pattern = "^MT-", x = rownames(dataset), value = FALSE)]) dataset
Standard flow of Seurat is replaced by a single command. However, it is good to see this part to learn each step
## normalise the data dataset <- NormalizeData(object = dataset, normalization.method = "LogNormalize", scale.factor = 10000)
The log transformed RNA counts from each cell is now normalised using a size factor. In other words, the sum of counts for each cells is more or less the same. We can now compare different cells to each other
## Here, we select the top 1,000 highly variable genes (Hvg) for downstream analysis. dataset <- FindVariableFeatures(object = dataset, selection.method = 'mean.var.plot', mean.cutoff = c(0.0125, 3), dispersion.cutoff = c(0.5, Inf)) length(x = VariableFeatures(object = dataset)) #3084 ## Identify the 10 most highly variable genes top10 <- head(VariableFeatures(dataset), 10)
## Plot # plot variable features with and without labels plot1 <- VariableFeaturePlot(dataset) plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE) #plot1 plot2
# Generate a vector with the top 1000 hvgens hv.genes <- head(rownames(HVFInfo(object = dataset)), 1000)
Scale the values for each features. The result is in units of 'standard deviation' where the mean expression is zero. If a gene is expressed 1 standard deviation more than the mean across all cells, it will have a value of 1. During this process, we can also 'regress out' technical effects. For instance, the number of features and the mitochondrial percentage.
dataset <- ScaleData(object = dataset, features = hv.genes, vars.to.regress = c("nFeature_RNA") )
You can run PCA, t-SNE and UMAP in simple one-line codes... The parameters are NOT random, but we will ignore this for the sake of time
# Dimention reduction dataset <- RunPCA(object = dataset, features = hv.genes, verbose = FALSE,npcs = 50)
dataset <- RunUMAP(object = dataset, reduction = "pca", dims = 1:50, verbose = FALSE) dataset <- RunTSNE(object = dataset, dims = 1:50, verbose = FALSE)
First, let's have a look at the first three dimensions. For this, we will take advantage of some of hte metadata generated by the authors: Supercluster information
options(repr.plot.width=12, repr.plot.height=6) DimPlot(dataset,cols = 'Spectral',group.by = 'Supercluster',pt.size = 1,reduction = 'pca',dims = c(1,2)) DimPlot(dataset,cols = 'Spectral',group.by = 'Supercluster',pt.size = 1,reduction = 'pca',dims = c(3,4))
1. What differences do you see between different principle components?
2. Some genes contribute more to the variation then others... What do these genes and principle components tell us?
Discuss with your neighbours, in groups of 3, for 2 min. We will go through the groups
You see the top genes that contribute to the respective principle components
print(dataset[["pca"]], dims = 1:5, nfeatures = 5)
VizDimLoadings(dataset, dims = 1:2, reduction = "pca")
options(repr.plot.width=12, repr.plot.height=9) DimHeatmap(dataset, dims = 1:4, cells = 500, balanced = TRUE, ncol = 2)
How many principle components do we need? There are some practical guidelines that will help with your analysis
We can do this with JackStrawPlot() function of Seurat. This is powerful, but time consuming. Running it altogether slows the server considerably. Also, we cannot visualise all different PCs (only up to 25). We will skip it for now (you can run it during the practice if you want to) and instead look at PCs using heatmaps
dataset <- JackStraw(object = dataset, num.replicate = 100) dataset <- ScoreJackStraw(object = dataset, dims = 1:20) JackStrawPlot(object = dataset, dims = 1:20)
We will plot the top genes from selected PCs and see how they differe between cells.
options(repr.plot.width=18, repr.plot.height=9) DimHeatmap(dataset, dims = c(1,2,5,10, 15,20,25,30, 35,40,45,50), cells = 250, balanced = TRUE, ncol = 4)
What do you see?
1. When do you think the variation is meaningful, and when not?
2. Some genes contribute more to the variation then others... What do these genes and principle components tell us?
3. What does it mean in terms of 'Superclusters'?
4.How would our results change if we include less or more principle components for downstream analysis?
Think for yourself for 2 min, then talk to your neighbors, in groups of 3, for 2 min. We will go through the groups
# Plots the standard deviations (or approximate singular values if running PCAFast) of the principle components for easy identification of an elbow in the graph. This elbow often corresponds well with the significant dims and speeds up running than Jackstraw # In this example, it looks like the elbow would fall around PC 11, then 25 and 35 again. After this, there is little change. options(repr.plot.width=12, repr.plot.height=6) ElbowPlot(object = dataset, reduction = "pca",ndims = 50)
1. What does it mean in terms of 'Superclusters'?
2. When do you think the variation in a PC is meaningful, and when not?
3.How would our results change if we include less or more principle components for downstream analysis?
Think for yourself for 2 min, then talk to your neighbors, in groups of 3, for 2 min. We will go through the groups
here, we will use graph-based clustering to identify groups of cells that are similar, which we will call 'clusters'
#SNN Graph Construction dataset <- FindNeighbors(object = dataset, dims = 1:35) #cell the pcs based on Jackstraw and Elbow plots #Find clusters using the louvain algorithm dataset <- FindClusters(object = dataset, resolution = 1) # changing the resolution will change the size/number of clusters! c(0.6, 0.8, 1)
Let's have a look at the QC stats agian. As default, this time the information will be displayed per cluster
# Statistics options(repr.plot.width=12, repr.plot.height=6) VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA"),pt.size = 0)
We provide some markers from the literature as a reference for identifying cell types. You can find more of these in Siletti et al 2022 Bioarxiv manuscript
ELAVL2, TUBB3: Neurons PLP1: Oligodendrocytes GFAP: Astrocytes ITGAM: Microglia PECAM1: Endothelial cells
options(repr.plot.width=18, repr.plot.height=12) ColorRamp <- colorRampPalette(rev(brewer.pal(n = 7,name = "RdYlBu")))(100) # This is how to generate a color key FeaturePlot(object = dataset, features = c("ELAVL2","TUBB3","PLP1","GFAP","ITGAM","PECAM1"),reduction = 'tsne',cols = ColorRamp, ncol =3)
Neurons are very diverse! We can use additonal markers to classify them functionally as 'Glutamatergic' and 'GABAergic'
SLC17A5, SLC17A6: Glutamatergic neurons GAD1, GAD2, SLC32A1: GABAergic neurons
options(repr.plot.width=18, repr.plot.height=12) ColorRamp <- colorRampPalette(rev(brewer.pal(n = 7,name = "RdYlBu")))(100) # This is how to generate a color key FeaturePlot(object = dataset, features = c("ELAVL2","GAD1","GAD2","SLC17A5","SLC17A6","SLC32A1"),reduction = 'tsne',cols = ColorRamp, ncol =3)
options(repr.plot.width=18, repr.plot.height=6) VlnPlot(object = dataset, features = c("ELAVL2","TUBB3","PLP1","GFAP","ITGAM","PECAM1"),pt.size = 0, ncol =2) #, slot = 'counts' # you can plot raw counts as well
options(repr.plot.width=18, repr.plot.height=6) VlnPlot(object = dataset, features = c("ELAVL2","GAD1","GAD2","SLC17A5","SLC17A6","SLC32A1"),pt.size = 0, ncol =2) #, slot = 'counts' # you can plot raw counts as well
You can use marker gene information to name each cluster as
cell classes e.g. oligodendrocytes, neurons or marker-defined cell types e.g. GlutamatergicNeuron_0
Your clusters are saved in the metadata, as part of the seurat object, with the name "seurat_clusters". We compare this to the metadata from the authors, which will help you decide which clusters belong to which cell type
plot21 <- DimPlot(object = dataset, reduction = 'tsne',group.by = "seurat_clusters") plot22 <- DimPlot(object = dataset, reduction = 'tsne',group.by = "Supercluster") plot23 <- DimPlot(object = dataset, reduction = 'tsne',group.by = "Clusters") #plot, adjusting the size of the graph options(repr.plot.width=18, repr.plot.height=6) plot21 | plot22 options(repr.plot.width=12, repr.plot.height=6) plot23
We can visualise the number of cells labeled with each label using the table() function
table(dataset@meta.data[['Supercluster']])
1. Why are there such big differences in the number of clusters between different analysis?
Shoot your answer!
To compare them, we will do a trick. First, generate a dataframe with teh information you need, then use the table() function of make a confusion matrix
df <- as.data.frame(cbind(dataset@meta.data[['Supercluster']],dataset@meta.data[['seurat_clusters']])) rownames(df) <- colnames(dataset) colnames(df) <- c('Supercluster','seurat_clusters') #Supercluster Clusters
head(table(df)[,1:10]) #head() automatically looks at teh first 6 rows. We also shorthen the columns to 10
Quick hetmap to visualise which cell types belong to which cluster
heatmap(table(df),xlab ='seurat_clusters')
1. What does the heatmap tell you?
Shoot your answer!
We often want to see what makes a cluster special, or different than the rest. You can achieve this here
cluster1.markers <- FindMarkers(object = dataset, ident.1 = 0, min.pct = 0.25) #, ident.2 = c(0, 3) to compare head(x = cluster1.markers, n = 5)
| p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | |
|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
| LAMA2 | 0 | 1.8374140 | 0.930 | 0.240 | 0 |
| LINC02296 | 0 | 1.5821047 | 0.909 | 0.220 | 0 |
| AC117453.1 | 0 | 1.1147041 | 0.826 | 0.146 | 0 |
| TXK | 0 | 0.4880059 | 0.713 | 0.100 | 0 |
| COL5A2 | 0 | 1.7736432 | 0.871 | 0.165 | 0 |
This is extremely useful to define what these cells are! However, you will see that a 'global comparison' is not always as accurate as you would expect
library(dplyr) dataset.markers <- FindAllMarkers(object = dataset, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) # we have just calculated the marker genes for each cell type
We can select the top genes for each cluster. Here, we will select the top 2 genes
dataset.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_log2FC) %>% head(n=10) # Takes a bit of time
| p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene |
|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> | <chr> |
| 0.000000e+00 | 2.174604 | 0.933 | 0.192 | 0.000000e+00 | 0 | AJ009632.2 |
| 0.000000e+00 | 1.837414 | 0.930 | 0.240 | 0.000000e+00 | 0 | LAMA2 |
| 3.432075e-273 | 2.172005 | 0.980 | 0.302 | 1.264445e-268 | 1 | LINC02306 |
| 1.070594e-272 | 2.079003 | 1.000 | 0.472 | 3.944284e-268 | 1 | CBLN2 |
| 3.836338e-228 | 2.675527 | 0.943 | 0.252 | 1.413384e-223 | 2 | TSHZ2 |
| 7.190384e-154 | 2.421455 | 0.838 | 0.263 | 2.649081e-149 | 2 | AC109466.1 |
| 6.939798e-191 | 2.931752 | 1.000 | 0.629 | 2.556760e-186 | 3 | ERBB4 |
| 9.019916e-172 | 2.437787 | 0.997 | 0.614 | 3.323118e-167 | 3 | ZNF804A |
| 6.064442e-179 | 2.507001 | 0.890 | 0.260 | 2.234262e-174 | 4 | AC109466.1 |
| 1.731711e-160 | 2.440303 | 0.985 | 0.628 | 6.379970e-156 | 4 | CASC15 |
save the data as separate sheets in an excel file
library(openxlsx) ## Create Workbook object and add worksheets wb <- createWorkbook() ## Add the cluster info for ( n in unique(dataset.markers$cluster) ) { addWorksheet(wb, n) writeData(wb, n, dataset.markers[dataset.markers$cluster == n,], startCol = 1, startRow = 1, rowNames = TRUE, colNames = TRUE) } #save saveWorkbook(wb, paste(name,"_cell_clust_diff_genes.xlsx",sep=""), overwrite = TRUE)
In this case, we are plotting the top 20 markers (or all markers if less than 20) for each cluster.
dataset.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC) -> top10
options(repr.plot.width=12, repr.plot.height=12) #dataset <- ScaleData(object = dataset, features = rownames(top10$gene)) #,vars.to.regress = "percent.mt" DoHeatmap(object = dataset, features = top10$gene[1:100],slot = 'scale.data') + NoLegend()
What do you see? :
1. How do the number of genes and RNA molecules correlate?
2. There are two populations visible on the violin plot... What do these mean?
Answer whenever you want!
More information: https://satijalab.org/seurat/
https://satijalab.org/seurat/v3.1/visualization_vignette.html
small <- subset(dataset, idents = c('0','1'))
small
DimPlot(small, reduction = 'tsne')
You can proceed from step 3.2 to repeat the analysis of a subset of hte dataset that contains clusters 1 and 2